CACHING IN MULTIDIMENSIONAL 
DATABASES* 



Istvan Szepkuti 

ING Insurance Hungary Pte. Co. Ltd. 
H-1068 Budapest, Dozsa Gyorgy lit 84/b, Hungary 
e-mail: szepkuti@inf.u-szeged.hu 

Received: September 12, 2006 



Abstract 

One utilisation of multidimensional databases is the field of On-line 
Analytical Processing (OLAP). The applications in this area are designed 
to make the analysis of shared multidimensional information fast [9]. 

On one hand, speed can be achieved by specially devised data struc- 
tures and algorithms. On the other hand, the analytical process is cyclic. 
In other words, the user of the OLAP application runs his or her queries 
one after the other. The output of the last query may be there (at least 
partly) in one of the previous results. Therefore caching also plays an 
important role in the operation of these systems. 

However, caching itself may not be enough to ensure acceptable per- 
formance. Size does matter: The more memory is available, the more we 
gain by loading and keeping information in there. 

Oftentimes, the cache size is fixed. This limits the performance of 
the multidimensional database, as well, unless we compress the data in 
order to move a greater proportion of them into the memory. Caching 
combined with proper compression methods promise further performance 
improvements. 

In this paper, we investigate how caching influences the speed of OLAP 
systems. Different physical representations (multidimensional and table) 
are evaluated. For the thorough comparison, models are proposed. We 
draw conclusions based on these models, and the conclusions are verified 
with empirical data. In particular, using benchmark databases, we show 
examples when one physical representation is more beneficial than the 
alternative one and vice versa. 

Keywords: compression, caching, multidimensional database, On-line An- 
alytical Processing, OLAP. 



*Web: http://www.pp.bme.hu/ee/2007_3/pdf/ce2007_3_06.pdf Journal reference: Peri- 
odica Polytechnica Electrical Engineering, Vol. 51, Number 3-4, pp. 119-132, 2007; DOI: 
10.3311/pp.ee.2007-3-4.06. 



1 



2 



I. SZEPKUTI 



1 Introduction 

1.1 Motivation 

Why is it important to investigate the caching effects in multidimensional data- 
bases? 

A number of papers compare the different physical representations of data- 
bases in order to find the one resulting in higher performance than others. For 
examples, see [I1[TTJ[T2J[T2J[T31[2T]. However, many of these papers either ignore 
the influence of caching or discusses this issue very briefly. 

As it will be shown later, the size of the buffer cache affects the results 
significantly. Hence the thorough analysis of the buffering is necessary in order 
to better understand what is the real reason of the performance improvements. 

1.2 Results 

The results of this paper can be summarized as follows: 

• Two models are proposed to analyse the caching effects of the alternative 
physical representations of relations. 

• With the help of the models, it is shown that the performance difference 
between the two representations can be several orders of magnitude de- 
pending on the size of the buffer cache. 

• It is also demonstrated that the generally better multidimensional physical 
representation may become worse, if the memory available for caching is 
large enough. 

• The models are verified by a number of experiments. 

1.3 Related Work 

In the literature, several papers deal with compressed databases: For further 
details the reader may wish to consult [U [SJ [7J [TSJ QI5] . 

The paper of Westmann et al. [18] lists several related works in this field. 
It also discusses how compression can be integrated into a relational database 
system. It does not concern itself with the multidimensional physical representa- 
tion, which is the main focus of our paper. They demonstrate that compression 
indeed offers high performance gains. It can, however, also increase the running 
time of certain update operations. In this paper we will analyse the retrieval (or 
point query) operation only, as a lot of On-line Analytical Processing (OLAP) 
applications handle the data in a read only or read mostly way. The database 
is updated outside working hours in batch. Despite this difference, we also en- 
countered performance degradation due to compression when the entire physical 
representation was cached into the memory. In this case, at one of the bench- 
mark databases (TPC-D), the multidimensional representation became slower 
than the table representation because of the CPU-intensive Huffman decoding. 

In this paper, we use difference - Huffman coding to compress the multidi- 
mensional physical representation of the relations. This method is based on 
difference sequence compression, which was published in [13] . 
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Chen et al. [2] propose a Hierarchical Dictionary Encoding and discusses 
query optimization issues. Both of these topics are beyond the scope of our 
paper. 

In the article of O'Connell et al. J , compressing of the data itself is 
analysed in a database built on a triple store. We remove the empty cells from 
the multidimensional array, but do not compress the data themselves. 

When we analyse algorithms that operate on data on the secondary storage, 
we usually investigate how many disk input/output (I/O) operations are per- 
formed. This is because we follow the dominance of the 1/ O cost rule [3] . We 
followed a similar approach in Section [3] below. 

The main focus of PQ is the CPU cache. In our paper, we deal with the 
buffer cache as opposed to the CPU cache. 

Vitter et al. [T7] describe an algorithm for prefetching based on compres- 
sion techniques. Our paper supposes that the system does not read ahead. 

POESS et al. 10 show how compression works in Oracle. They do not test 
the performance for different buffer cache sizes, which is an important issue in 
this paper. 

In [20 , Xi et al. predict the buffer hit rate using a Markov chain model for a 
given buffer pool size. In our article, instead of the buffer hit rate, we estimate 
the expected number of pages brought into the memory from the disk, because 
it is proportional to the retrieval time. Another difference is that we usually 
start with a cold (that is empty) cache and investigate its increase together with 
the decrease in retrieval time. In [2D], the authors fix the size of the buffer pool 
and then predict the buffer hit rate with the Markov chain model. 

1.4 Organisation 

The rest of the paper is organised as follows. Section [2J describes the different 
physical representations of relations including two compression techniques used 
for the multidimensional representation. Section [3] introduces a model based 
on the dominance of the I/O cost rule for the analysis of the caching effects. 
An alternative model is presented in Sectional The theoretical results are then 
tested in experiments outlined in Section [3] Section |BJ rounds off the discussion 
with some conclusions and suggestions for future study. Lastly, for the sake of 
completeness, a list of references ends the paper. 

2 Physical Representations of Relations 

Throughout this paper we use the expressions 'multidimensional representation' 
and 'table representation,' which are defined as follows. 

Definition 1. Suppose we wish to represent relation R physically. The multi- 
dimensional (physical) representation of R is as follows: 

• A compressed array, which only stores the nonempty cells, one nonempty 
cell corresponding to one element of R; 

• The header, which is needed for the logical-to-physical position transfor- 
mation; 

• One array per dimension in order to store the dimension values. 
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The table (physical) representation consists of the following: 

• A table, which stores every element of relation R; 

• A B-tree index to speed up the access to given rows of the table when the 
entire primary key is given. □ 

In the experiments, to compress the multidimensional representation, dif- 
ference-Huffman coding (DHC) was used, which is closely related to difference 
sequence compression (DSC) . These two methods are explained in the remainder 
of this section. 

Difference sequence compression. By transforming the multidimensional ar- 
ray into a one-dimensional array, we obtain a sequence of empty and nonempty 
cells: 

(E*F*)* 

In the above regular expression, E is an empty cell and F is a nonempty one. The 
difference sequence compression stores only the nonempty cells and their logical 
positions. (The logical position is the position of the cell in the multidimensional 
array before compression. The physical position is the position of the cell in 
the compressed array.) We denote the sequence of logical positions by Lj. This 
sequence is strictly increasing: 

Lq < L\ < ■ ■ ■ < Ljf-i- 

In addition, the difference sequence ALj contains smaller values than the orig- 
inal Lj sequence. (See also Definition [2] below.) 

The search algorithm describes how we can find an element (cell) in the 
compressed array. During the design of the data structures of DSC and the 
search algorithm, the following principles were used: 

• We compress the header in such a way that enables quick decompression. 

• It is not necessary to decompress the entire header. 

• Searching can be done during decompression, and the decompression stops 
immediately when the header element is found or when it is demonstrated 
that the header element cannot be found (that is, when the corresponding 
cell is empty). 

Definition 2. Let us introduce the following notations. 

N is the number of elements in the sequence of logical positions (N > 0); 

Lj is the sequence of logical positions (0 ^ j ^ N — I); 

AL = L Q ; 

ALj=Lj-Lj^ (i = l,2,...,iV-l); 

The Di sequence {Di £ {0, 1, . . . , D}, i = 0, 1, . . . , N — 1) is defined as follows: 

D _ ( AL,, if ALi ^ D and i > 0; 
1 \ 0, otherwise; 

where D = 2 s — 1, and s is the size of a Di sequence element in bits. 
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The Jk sequence will be denned recursively in the following way: 

j ={ L °> iffc = °; _ 

Lj, otherwise where j = min{i | AL, > D and Li > J/-_i}. 

Here the Di sequence is called the overflow difference sequence. There is an 
obvious distinction between ALi and Di, but the latter will also be called the 
difference sequence, if it is not too disturbing. Jk it is called the jump sequence. 
The compression method which makes use of the Dj and Jk sequences will be 
called difference sequence compression (DSC). The Di and Jk sequences together 
will be called the DSC header. □ 

Notice here that ALi and Di are basically the same sequence. The only 
difference is that some elements of the original difference sequence ALi are re- 
placed with zeros, if and only if they cannot be stored in s bits. (The symbol 
s denotes a natural number. The theoretically optimal value of s can be deter- 
mined, if the distribution of ALi is known. In practice, for performance reasons, 
s is either 8 or 16 or 32.) 

The difference sequence will also be called the relative logical position se- 
quence, and we shall call the jump sequence the absolute logical position se- 
quence. 

From the definitions of Di and Jk, one can see clearly that, for every zero el- 
ement of the Di sequence, there is exactly one corresponding element in the 
Jk sequence. For example, let us assume that Dq = D3 = = 0, and 
D\, D2, D4, Dq, D7, Dg > 0. Then the above mentioned correspondence is shown 
in the following table: 





Di 


D 2 


D 3 


D A 


D 5 


D e 


D 7 


D s 




Jo 






Ji 




J2 











From the above definition, the recursive formula below follows for Lj. 

, f Lj-i+Dj, Wlh O: 
3 1 Jk, otherwise where k = min{i | Ji > Lj-\). 

In other words, every element of the Lj sequence can be calculated by adding 
zero or more consecutive elements of the Di sequence to the proper jump se- 
quence element. For instance, in the above example 

La = Jo', 

L x = J0 + D1, 

L 2 = Jo + D l +D 2 ; 

L3 = Ji', 

£4 = Ji + D4; 

and so on. 

A detailed analysis of DSC and the search algorithm can be found in [TJ] . 

Difference - Huffman coding. The key idea in difference - Huffman coding is 
that we can compress the difference sequence further if we replace it with its 
corresponding Huffman code. 
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Definition 3. The compression method, which uses the jump sequence ( J&) and 
the Huffman code of the difference sequence (Di), will be labelled difference - 
Huffman coding (DHC). The sequence and the Huffman code of the Di 
sequence together will be called the DHC header. □ 

The difference sequence usually contains a lot of zeros. Moreover, it contains 
many ones too if there are numerous consecutive elements in the Lj sequence 
of logical positions. By definition, the elements of the difference sequence are 
smaller than those of the logical position sequence. The elements of Dj will 
recur with greater or less frequency. Hence it seems reasonable to code the 
frequent elements with fewer bits, and the less frequent ones with more. To 
do this, the optimal prefix code can be determined by the well-known Huffman 
algorithm [5]. 

3 A Model Based on the Dominance of the I/O 
Cost Rule 

During our analysis of caching effects, we followed two different approaches: 

• For the first model, we applied the dominance of the I/O cost rule to 
calculate the expected number of I/O operations. 

• In the second one, instead of counting the number of disk inputs/outputs, 
we introduced two different constants: D m and D t . The constant D m 
denotes the time needed to retrieve one cell from the disk, if the multidi- 
mensional representation is used. The constant D t shows the time required 
to read one row from the disk, if the table representation is used. The con- 
stants were determined experimentally. The tests showed that D m <C D t , 
that is more disk I/O operations are needed to retrieve one row from the 
table representation than one cell from the multidimensional representa- 
tion which is obvious when there is no caching. However, for the second 
model, it was not necessary to compute the exact number of I/O opera- 
tions for the alternative physical representations due to the experimental 
approach. 

The first model is described in this section, whereas the second model in the 
next one. 

Throughout the paper, we suppose that the different database pages are 
accessed with the same probability. In other words, uniform distribution will 
be assumed. 

It is not hard to see that this assumption corresponds to the worst case. 
If the distribution is not uniform, then certain partitions of the pages will be 
read/written with higher probability than the average. Therefore it is more 
likely to find pages from these partitions in the buffer cache than from other 
parts of the database. Hence the non-uniform distribution increases the buffer 
hit rate and thus the performance. 

We are going to estimate the number of database pages (blocks) in the buffer 
cache. First it will be done for the multidimensional representation, then for 
the table representation. 
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Multidimensional physical representation. In this paper, we shall assume 
that prefetching is not performed by the system. Hence, for the multidimen- 
sional representation, one or zero database page has to be copied from the disk 
into the memory, when a cell is accessed. This value is one if the needed page 
is not in the buffer cache, zero otherwise. 

The multidimensional representation requires that the header and the di- 
mension values are preloaded into the memory. The total size of these will be 
denoted by H. The compressed multidimensional array can be found on the 
disk. The pages of the latter are gradually copied into the memory as a result 
of caching. Thus the total memory occupancy of this representation can be 
computed by adding H to the size of the buffer cache. 

Definition 4. In this section, for the multidimensional representation, we shall 
use the following notation. 

N is the number of pages required to store the compressed array (N ^1); 

Bi is the expected value of the number of pages in the buffer cache after the i th 

database access (i ^ 0). □ 

Theorem 1. Suppose that Bk is less than the size of the memor$\ available 
for caching for every k G {0,1,..., i} index. In addition, let us assume that 
the buffer cache is 'cold' initially, i.e. Bq = 0. Then, for the multidimensional 
representation, 

B, = X I I - ' ' 




Proof. The theorem will be proven by induction. For convenience, let us dehne 
d as follows: 

N 



For i = 0, the theorem holds H 




Now assume that the theorem has already been proven for i — 1: 

Bi-. x = N (1 - d 1 - 1 ) . 

Then for i we obtain that 

Bi = + x -JLi + i x 

Because of the uniform distribution, Bi ~ 1 is the probability that the required 
database block can be found in the memory. Zero new page will be copied from 
the disk into the buffer cache in this situation. However, in the opposite case, 
one new page will be brought into the memory. This will occur with probability 
A u In other words, the expected value of the increase is 



N 



Bi-i N — N-Bi-! Bi-x 
Ox— — + lx — = — = 1 — . 1 

X X X X K ' 



1 Please note that the memory size is also measured in pages in this section. 

2 We define 0° as 1. In this way, the theorem remains true for the special case of JV = 1. 
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Hence 

B t = Bi-i + 1 - = M - -ij + 1 = B^d + 1. 

From the induction hypothesis follows that 

Bi = N (l - a 4 ' 1 ) d + 1. 

It is easy to see that 

= 1 + d + d 2 + d 3 + ■ ■ • + d i_1 = ^—4 =N(l-rf). 

1 — d 

The last formula can be written as 

Bi = N I 1 - ( 1 - — 
which proves the theorem. 



The time to retrieve one cell from the multidimensional representation is 
proportional to the number of pages brought into the memory. The latter is a 
linear function of the size of the buffer cache. This is rephrased in the following 
theorem. 

Theorem 2. Assume that the number of database pages in the buffer cache is 
B. The memory available for caching is greater than B. Let us suppose that 
a cell is accessed in the multidimensional representation. Then the expected 
number of pages copied from the disk into the memory is 

N 

Proof. Similarly to Equation ([1]), the expected number of pages necessary for 
this operation is 

B N-B N-B B 
Ox hi x = = 1 . 

N N N N 



Remark 1. The above theorem holds even if B is equal to the number of pages 
available for caching. However, in this case, the database management system 
(or the operating system) has to remove a page from the buffer cache, if a page 
fault happens. If the removed page is 'dirty,' then it has to be written back 
to the disk in order not to lose the modifications. That is why another disk 
I/O operation is needed. In this paper, we are going to ignore these situations, 
because most OLAP applications handle the data in a read only or read mostly 
way. 

Figure [1] illustrates the behaviour of the multidimensional representation. 
The horizontal axis shows the number of pages in the buffer cache. The vertical 
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Figure 1: The expected number of pages copied from the disk into the memory, 
if the multidimensional representation is used 




V(B) 



10,000 20,000 30,000 40,000 50,000 60,000 



one demonstrates the expected number of pages retrieved from the disk. The 
f(B) function is defined as follows: 



f(B) 



B 
N' 



Table physical representation. Now, let us turn to the other storage method, 
the table representation. Both the table and B-tree index are kept on the disk. 
The table itself could be handled similarly to the compressed array, but the B- 
tree index is structured differently. It consists of several levels. In our model, we 
are going to consider these levels separately. To simplify the notation, the table 
will also be considered as a separate level. The following definition introduces 
the necessary notations. 

Definition 5. L ^ 2 is the number of levels in the table representation. On 
level 1, the root page of the B-tree can be found. Level L — 1 is the last level of 
the B-tree, which contains the leaf nodes. Level L corresponds to the table. 
Nf ^ 1 is the number of pages on level I (1 ^ I ^ L). Specifically, N\ = 1, as 
there is only one root page. 
The total number of pages is 



N 



5>. 



(2) 



B^f 1 ^ is the number of pages in the buffer cache from level I after the i th 

database access (I ^ I ^ L and i ^ 0). 

The total number of pages in the buffer cache is 



Bi 



5>: 

e=i 



(3) 
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□ 

Theorem 3. Suppose that Bk is less than the size of the memory available for 
caching for every k S {0,1,..., i} index. In addition, let us assume that the 
buffer cache is cold initially: Bq = 0. Then, for the table representation, 



Proof. Observe that we can apply the result of Theorem [T] at each level: 




(4) 

The assertion of the theorem follows from the definitions of N and Bi shown in 
Equations © and ©: 

b, = n--£n,(i-±-' 



Similarly to the other representation, the necessary time to retrieve one row 
from the table representation is proportional to the number of pages brought 
into the memory. The next theorem investigates how the number of pages 
brought into the memory depends on the size of the buffer cache. 

Theorem 4. Assume that the number of database pages in the buffer cache is 
Bi = Ylf—i . The memory available for caching is greater than Bi. Let us 
suppose that a row is accessed in the table representation. Then the expected 
number of pages read from the disk into the memory is 

— B«> 



=1 



Proof. This will be shown by applying the result of Theorem [5] per level. For 
level i, the number of pages copied into the memory is: 



1 



BP 



Hence, for all levels in total, it is: 
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L, Ni, N2, ■ ■ ■ , Nl are constants. Therefore Equation (JSJ) is a linear function 
of b!> 2 ' , . . . , b\ L \ The same expression can be looked at as a function of 
Bi, as well: 

Definition 6. 

1=1 1 

□ 

Just like before, we are going to assume that the buffer cache is cold initially: 
Bq = 0. If this is the case, then Bq = for every £ £ {1,2,..., L}, because of 
Definition [5j Therefore, 

m)=L-±± = L. 

1=1 1 

In other words, one page per level has to be read into the memory at the first 
database access. If the memory available for caching is not smaller than L, then 
B\ — 1 for every £ and 

i=i e=i 

Obviously, we obtain the same, if we use the alternative (recursive) formula: 

B 1 = B + f{B ) = + L = L. 

Now, let us investigate the special case, when N m — max{7Vi, N2, ■ . • , A^} = 
1. Because of the latter, there is only one page per level (N\ = N2 = ■ • ■ = Ng, = 
1), which means that N also equals L. To put it into another way, the entire 
database is cached into the memory after the first database access, given that 
the available memory is greater than or equal to the size of the database. After 
this, there is no need to copy more pages into the memory: 

L (I) L 

f(Bi)=L-J2^=L-j: 1 -=L-L = 0. 

i=\ 1=1 

To summarise this paragraph, below we show the values of Bi and j(Bi) for 
every i: 

S = 0, 

B\ = B 2 = ■ ■ ■ = Bi = ■ ■ ■ = L, 
f(B ) - L, 

f(B ± ) = f(B 2 ) = --- = f(B l ) = ■■■ = (). 

In the remainder of this section, we shall assume that N m > 1. 

For sufficiently large i values, f{Bj) can be considered a linear function of 
Bi. This is the main idea behind the theorem below. 
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Theorem 5. Suppose that Bk is less than the size of the memory available for 
caching for every k £ {0, 1, . . . , i} index. In addition, let us assume that Bq = 0, 
Bi < N and f(Bi) =/= 0. Then, for the table representation, 

j\ B i) -> — Tf , ifl^OO, 

I'm 

where N m — max{7Vi, N 2 , ■ • • , Nl}. 
Proof. First, we show that 

N - B, 



f(Bi 



w, 



where Wi is a weighted average of constants N\, N%, ■ ■ ■ , Nl- Then we demon- 
strate that W{ tends to N m , if i tends to infinity. From Equation ((4]), we know 
that 



Sf N e -N e (l-±) , ! 



= 1-1- 



N e N e V N t 

Using Definition |6l we obtain that 



Theorem [3] implies the following equation: 

L 

N - Bi = Nt ( 1 

1=1 

Let us define Wi as follows: 



1 

N~ e 



given that the denominator is not zero (/(£>i) 7^ 0). Observe that Wi is a 
weighted average of constants Ni, N 2 , . . . , Nl- The weight of Ne is ^1 — 
for every £ £ {1,2,..., L}. With the previous definition, we get that 



/(Bi) 

If Wi does not vanish (Bi < N), then 

Finally, we have to prove that Wi — S- N m , if i — > 00. For every I G {1,2,... L}, 
the inequality 1 _ _ iV m holds. It is not difficult to see that 



('-*)' 



0, if iVe < and i -> 00. (6) 
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Figure 2: The expected number of pages copied from the disk into the memory, 
if the table representation is used 




Sf(Bi) 
\ Est 



50,000 100,000 150,000 200,000 250,000 300,000 350,000 



Obviously 



Nr 



1, if Nt = N m > 1. 



('-*) 

From Equations © and (J7J, it follows immediately, that 



(7) 



VK, = ^ 4f- = ^ — -> N m , if i -»■ cx). 



Figure [5] demonstrates the behaviour of the table representation. The hori- 
zontal axis is the number of pages in the buffer cache. The vertical one shows 
the expected number of pages retrieved from the disk. The Estimation denoted 
by 'Est.' in the chart is the limit of the f(Bi) function: 



Estimation ■ 



N-Bj 



We conclude this section by summarising the findings: 

• If we assume requests with uniform distribution, then the expected number 
of database pages brought into the memory at a database access is a linear 
function of the number of pages in the buffer cache. 

• Specifically, for the multidimensional representation, it equals 

B 



1 - 



N' 
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where B is the number of pages in the buffer cache and N is the size of 
the compressed multidimensional array in pages. 

• For the table representation, it is 

where L is the number of levels, B\ is the number of pages in the buffer 

cache from level £, Bi = J2i=i anc ^ i s tne total number of pages 
on level I. 

• The expression above is a linear function of fif \ . . . , b\ L) , but for 
large i values, it can be considered as a linear function of Bi, as well, 
because 

N-B t ... 
f(Bi) -> — , if * oo, 

where iV m = max{Aq, iV 2 , . . . , N L } and TV = N e- 



4 An Alternative Model 

In this section we shall examine how the caching affects the speed of retrieval 
in the different physical database representations. For the analysis, a model 
will be proposed. Then we will give sufficient and necessary conditions for such 
cases where the expected retrieval time is smaller in one representation than in 
the other. 

The caching can speed up the operation of a database management system 
significantly if the same block is requested while it is still in the memory. In 
order to show how the caching modifies the results of this paper, let us introduce 
the following notations. 

Definition 7. 

M = the retrieval time, if the information is in the memory, 
D = the retrieval time, if the disk also has to be accessed, 

p = the probability of having everything needed in the memory, 

Q = 1 - P, 

£ = how long it takes to retrieve the requested information. 

□ 

In our model we shall consider M and D constants. Obviously, £ is a random 
variable. Its expected value can be calculated as follows: 

E(£) = pM + qD. 

Notice that D docs not tell us how many blocks have to be read from the 
disk. This also means that the value of D will be different for the table and 
the multidimensional representations. The reason for this is that, in general, at 
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most one block has to be read with the multidimensional representation. Ex- 
actly one reading is necessary if nothing is cached, because only the compressed 
multidimensional array is kept on the disk. Everything else (the header, the 
dimension values, and so forth) is loaded into the memory in advance. With the 
table representation, more block readings may be needed because we also have 
to traverse through the B-tree first, and then we have to retrieve the necessary 
row from the table. 

M is also different for the two alternative physical representations. This is 
because two different algorithms are used to retrieve the same information from 
two different physical representations. 

Hence, for the above argument, we are going to introduce four constants. 

Definition 8. 

M m = the value of M for the multidimensional representation, 

M t — the value of M for the table representation, 
D m = the value of D for the multidimensional representation, 

D t — the value of D for the table representation. 

□ 

If we sample the cells/rows with uniform probability, we can then estimate 
the probabilities as follows: 

the number of cached pages 
^ the total size in pages ' 

q = I -p. 

By the 'total size' we mean that part of the physical representation which can be 
found on the disk at the beginning. In the multidimensional representation, it 
is the compressed multidimensional array, whereas in the table representation, 
we can put the entire size of the physical representation into the denominator 
of p. The cached pages are those that had been originally on the disk, but were 
moved into the memory later. In other words, the size of the cached blocks 
(numerator) is always smaller than or equal to the total size (denominator). 

The experiments show that the alternative physical representations differ 
from each other in size. That is why it seems reasonable to introduce four 
different probabilities in the following manner. 

Definition 9. 

p m = the value of p for the multidimensional representation, 
Pt — the value of p for the table representation, 

Qm 1 Pm j 

q t = 1-pt- 

□ 



3 In this section, just like in the previous one, we shall make the same assumption that 
every cell/row is sampled with the same probability. 
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When does the inequality below hold? This is an important question: 

nu) < E&). 

Here £ TO and £ t are random variables that are the retrieval times in the multi- 
dimensional and table representations, respectively. 

In our model, E(£,) = pjMj + qiDi (i e {m,t}). Thus the question can be 
rephrased as follows: 

P m M m + q m D m <p t M t + q t D t . 

The value of the M m , D mi M t and D t constants was measured by carrying 
out some experiments. (See the following section.) Two different results were 
obtained. For one benchmark database (TPC-D), the following was found: 

M t < M m < D m < D t . 

Another database (APB-1) gave a slightly different result: 

M m < M t < D m < D t . 

The M m <C I?™ and M t <C D m inequalities hold because disk operations are 
slower than memory operations by orders of magnitude. The third one (D m <C 
D t ) is because we have to retrieve more blocks from the table representation 
than from the multidimensional to obtain the same information. 

Note here that E(£j) is the convex linear combination of Mj and Z?j (pj, qi £ 
[0, 1] and i G {m,t}). In other words, E(^) can take any value from the closed 
interval [Mj, Di]. 

The following provides sufficient condition for E(£ m ) < E(£ t ): 

D m < Pt M t +q t D t . 

From this, we can obtain the inequality constraint: 

D m < p t Mt + (l-pt)D t , 
D t - D m 



Pt < 



Dt-Mt 



The value for %Z°M t was found to be 63.2%, 66.5% and 66.3% (for TPC-D, 
TPC-H and APB-1, respectively) in the experiments. This means that, based 
on the experimental results, the expected value of the retrieval time was smaller 
in the multidimensional representation than in the table representation when 
less than 63.2% of the latter one was cached. This was true regardless of the 
fact whether the multidimensional representation was cached or not. 

Now we are going to distinguish two cases based on the value of M m and 

M t . 

Case 1: M t < M m . This was true for the TPC-D benchmark database. 
(Here the difference sequence consisted of 16-bit unsigned integers, which re- 
sulted in a slightly more complicated decoding, as the applied Huffman decoder 
returns 8 bits at a time. This may be the reason why M m became larger than 
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M t .) In this case, we can give a sufficient condition for E(£ TO ) > E(£ t ), as the 
equivalent transformations below show: 

p t M t + q t D t < M m , 
p t M t + (1 - p t )D t < M m , 
D t -M n 



D t -M t 



< Pt- 



For we obtained a value of 99.9%. This means that the expected 

retrieval time was smaller in the table representation when more than 99.9% of it 
was cached. This was true even when the whole multidimensional representation 
was in the memory. 

Case 2: M m < M t . This inequality holds for the TPC-H and the APB-1 
benchmark databases. Here we can give another sufficient condition for E(£ m ) < 

PmM m + q m D m < M t , 
p m M m + (1 - p m )D m < M t , 
D m -M t 



D m - M n 



< Pm- 



The left hand side of the last inequality was equal to 99.9% and 98.3% for 
the TPC-H and APB-1 benchmark databases, respectively. In other words, 
when more than 99.9% of the multidimensional representation was cached, it 
then resulted in a faster operation on average than the table representation 
regardless of the caching level of the latter. 

Finally, let us give a necessary and sufficient condition for E(£ m ) < E(£ t ). 
First, let us consider the following equivalent transformations (making the nat- 
ural assumption that D t > M t ): 

nu) < e(6), (8) 

p m M m + q m D m < p t M t + q t D t , (9) 
p m M m + (1 - p m )D m < p t M t + (1 - p t )D t , (10) 

D m -M m D t -D m 
Pt < Dt-Mt Pm + ^M- t - 

The last inequality was the following for the three tested databases, TPC-D, 
TPC-H and APB-1, respectively: 

Pt < 0.368p m + 0.632, 
Pt < 0.335p m + 0.665, 
Pt < 0.343p m + 0.663. 

Theorem 6. Suppose that D t > M t . Then the expected retrieval time is smaller 
in the case of the multidimensional physical representation than in the table 
physical representation if and only if 

D m - M m D t - D m 



Pt < —R TT- P 



D t - M t " u D t - Mt 
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Proof. The truth of the theorem is a direct consequence of Equations (JSJ) - 

CD. ■ 

Now, let us change our model slightly. In this modified version, we shall 
assume that the different probabilities are (piecewise) linear functions of the 
memory size available. This assumption is in accordance with Theorems [5] and 
[5] With the multidimensional representation, the formula below follows from 
the model for the expected retrieval time: 

T m (x) = M m p m (x) + D m q m (x) = M rn p m (x) + D m (l -p rn (x)), 
T m {x) = (M m - D m )p m (x) + D m , 

where 

p m (x) = mm < — — — , 1 

H is the total size of the multidimensional representation part, which is loaded 
into the memory in advance (the header and the dimension values), C is the 
size of the compressed multidimensional array and x (=2 H) is the size of the 
available memory. 

In an analogous way, for the table representation, we obtain the formula: 
T t (x) = M tPt (x) + D t q t (x) = M tPt (x) + D t (l - p t (x)), 
T t (x) = (M t - D t )p t (x) + D u 

where 

p t (x) = min , 

S is the total size of the table representation and x 0) is the size of the 
memory available for caching. 

It is not hard to see that the global maximum and minimum values and 
locations of the functions T m (x) and T t (x) are the following: 



max{T m (a;) 


x Z H) 


= D m 


and 




— ^ in 


if and only if 


x = H, 


min{T m (a;) | 


x^H} 


= M m 


and 




= M m 


if and only if 


x^H + C, 


max{r t (x) 


\x ^ 0} 


= D t 


and 


Tt(x) 


= D t 


if and only if 


x = 0, 


min{T t (a;) 


| x > 0} = 


= M t 


and 


Tt(x) 


= M t 


if and only if 


x^S. 



Definition 10. For x ^ H values, let us define the speed-up factor in the 
following way: 

Tt(x) 



speed-up(x) 



T m (x) ' 

□ 



The global maximum of the speed-up factor can be achieved, when the entire 
multidimensional representation is cached into the memory. This is specified in 
the following theorem. 
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Theorem 7. Suppose that 

M t -D t ^ M m - D m M m - D m 
0> > — and 0< — H + D m < D t . (12) 

Then the global maximum of the speed-up(x) function can be found at C + H . 

Proof. The speed-up(x) function is continuous, because T t (x) and T m (x) are 
continuous and T m {x) ^ 0. Hence, to prove the theorem, it is enough to show 
that this function is strictly monotone increasing on interval (H, C+H), strictly 
monotone decreasing on (C + H, S) and constant on (S, oo). On the first 
interval, 



■up(x) 



(M t - D t )pt(x) + D t _ (Af t -A)f+A 



(M m - D m ) Pm (x) + D m (M m - D m )Z^t + D n 

For convenience, let us introduce the following notation: 

M t - D t 
at = — , 

h = D t , 
a 2 = ■ 



C 

, M m — D rn 

b 2 = ^ H + D m . 

The first derivative of the speed-up(x) function is 

, /aiaf + 6iV aib 2 -a 2 bi 

speed-up (x) — — = — ? . 

W \a 2 x + b 2 J (a 2 x + b 2 ) 2 

The first derivative is positive if and only if 0162 — 0261 > 0. Equation (fT2|) can 
be written as 

> ai > a 2 (13) 

and 

< b 2 < bi. (14) 
Let us multiply Equation (1131) by b%, Equation (|14[) by a\. Then we obtain that 

di&i > 0261 

and 

a\b 2 > a\b\. 

From the last two inequalities, we get that ai&2 > 02^1, which is equivalent with 
ai&2 ~ 02^1 > 0. Thus speed- up 1 (x) > and speed-up(x) is strictly monotone 
increasing on interval (H, C + H). 

Now, suppose that x G (C + H, S). In this case 



speed-up(x) 



(M t - Dt)pt(x) + D t _ {M t -Dt)§+Dt _ aix + h 



M m M m M„ 

The fist derivative is 

speed-up (x) = —— < 0, 

Mm 
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because a% < and M m > 0. So speed-up(x) is strictly monotone decreasing. 
Finally, let us take the case, when x £ (S, oo). The speed-up factor 

speed-up(x) = — -, 

which is constant. I 



The location of the global maximum is C + H . The global maximum value 
is obviously 

speed-up(C + H) = — = — 2 . 

Mm Mm 

As it will be described in details in the next section, experiments were made to 
determine the value of the constants. For these data, see Table [5] there. The 
sizes were also measured and can be seen in Table [TJ (in bytes) together with 
the global maximum locations and values per benchmark database. As it can 
be seen from the latter table, the speed-up can be very large, 2-3 orders of 
magnitude. The maximum value for the TPC-D benchmark database was more 
than 400, while for the APB-1 benchmark database, it was more than 1,500. 



Table 1: Global maximum of speed-up(x) 



Symbol 


TPC-D 


TPC-H 


APB-1 


S 


279,636,324 


1,419,181,908 


1,295,228,960 


c 


48,007,720 


239,996,040 


99,144,000 


H 


19,006,592 


154,024,844 


4,225,039 


C + H 


67,014,312 


394,020,884 


103,369,039 


speed-up(C + H) 


416 


1,066 


1,549 



We can draw the conclusions of this section as follows: 

• If (nearly) the entire physical representation is cached into the memory, 
then the complexity of the algorithm will determine the speed of retrieval. 
A less CPU-intensive algorithm will result in a faster operation. 

• In the tested cases, the expected retrieval time was smaller with mul- 
tidimensional physical representation when less than 63.2% of the table 
representation was cached. This was true regardless of the caching level 
of the multidimensional representation. 

• Depending on the size of the memory available for caching, the speed-up 
factor can be very large, up to 2-3 orders of magnitude! In other words, 
the caching effects of the alternative physical representations modify the 
results significantly. Hence these effects should always be taken into ac- 
count, when the retrieval time of the different physical representations are 
compared with each other. 
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5 Experiments 

We carried out experiments in order to measure the sizes of the different physical 
representations and the constants in the previous section. We also examined how 
the size of the cache influenced the speed of retrieval. 

Table [2] shows the hardware and software used for testing. The speed of 
the processor, the memory and the hard disk all influence the experimental 
results quite significantly, just like the memory size. In the computer industry, 
all of these parameters have increased quickly over the time. But the increase 
of the hard disk speed has been somewhat slower. Hence, it is expected that 
the results presented will remain valid despite the continuing improvement in 
computer technology. 



Table 2: Hardware and software used for testing 



Processor 

Memory 

Hard disk 

Filesystem 

Page size of B-tree 

Operating system 

Kernel version 

Compiler 

Programming language 
Free 



Intel Pentium 4 with HT technology, 2.6 GHz, 
800 MHz FSB, 512 KB cache 
512 MB, DDR 400 MHz 

Seagate Barracuda, 80 GB, 7200 RPM, 2 MB cache 
ReiserFS format 3.6 with standard journal 
4 KB 

SuSE Linux 9.0 (i586) 

2.4.21-99-smp4G 

gcc (GCC) 3.3.1 (SuSE Linux) 

C 

procps version 3.1.11 



In the experiments we made use of three benchmark databases: TPC-D 
[T5] . TPC-H [TO] and APB-1 [5]. One relation (i?) was derived per benchmark 
database in exactly the same way as was described in [T2] . Then these relations 
were represented physically with a multidimensional representation and table 
representation. 

Tables |3l |4] and [5] show that DHC results in a smaller multidimensional 
representation than difference sequence compression. (For TPC-H, the so-called 
Scale Factor was equal to 5. That is why the table representation of TPC-H is 
about five times greater than that of TPC-D.) 



Table 3: TPC-D benchmark database 



Compression 


Size in bytes 


Percentage 


Table representation 






Uncompressed 


279,636,324 


100.0% 


Multidimensional representation 






Difference sequence compression 


67,925,100 


24.3% 


Difference - Huffman coding 


67,014,312 


24.0% 
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Table 4: TPC-H benchmark database 



Compression 


Size in bytes 


Percentage 


Table representation 

Uncompressed 


1,419,181,908 


100.0% 


Multidimensional representation 

Difference sequence compression 
Difference - Huffman coding 


407,414,614 
394,020,884 


28.7% 
27.8% 


Table 5: APB-1 benchmark database 


Compression 


Size in bytes 


Percentage 


Table representation 

Uncompressed 


1,295,228,960 


100.0% 


Multidimensional representation 

Difference sequence compression 
Difference - Huffman coding 


113,867,897 
103,369,039 


8.8% 
8.0% 



In the rest of this section, we shall deal only with DHC. Its performance will 
be compared to the performance of the uncompressed table representation. 

In order to determine the constant values of the previous section, an ex- 
periment was performed. A random sample was taken with replacement from 
relation R with uniform distribution. The sample size was 1000. Afterwards 
the sample elements were retrieved from the multidimensional representation 
and then from the table representation. The elapsed time was measured to cal- 
culate the average retrieval time per sample element. Then the same sample 
elements were retrieved again from the two physical representations. Before the 
first round, nothing was cached. So the results help us to determine the con- 
stants D m and D t . Before the second round, every element of the sample was 
cached in both physical representations. So the times measured in the second 
round correspond to the values of the constants M m and M t . The results of the 
experiment can be seen in Table [SJ 



Table 6: Constants 





TPC-D 


TPC-H 


APB-1 


Symbol 


(ms) 


(ms) 


(ms) 


M m 


0.031 


0.014 


0.012 


M t 


0.021 


0.018 


0.128 


D m 


6.169 


7.093 


6.778 


D t 


16.724 


21.165 


19.841 



In the next experiment, we examined how the size of memory available for 
caching influenced the speed of retrieval. In Figures [3J 0] and El T m (x) is labelled 
as 'Array Est.,' T t (x) as 'Table Est.' The horizontal axis shows the size of the 
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Figure 3: The retrieval time for the TPC-D benchmark database as a function 
of the memory size available for caching 




100,000,000 200,000,000 300,000,000 



memory in bytes, while the vertical one displays the expected/average retrieval 
time in milliseconds. 

In order to verify the model with empirical data, we performed the following 
tests. Random samples were taken with replacement. The sample size was set 
at 300 in TPC-D and 100 in TPC-H and APB-1 in order to stay within the 
constraints of the physical memory. The average retrieval time was measured 
as well as the cache size used for each physical representation. In the multi- 
dimensional representation, the utilized cache size was corrected by adding H 
to it, as this representation requires that some parts of it are loaded into the 
memory in advance. Then the above sampling and measuring procedures were 
repeated another 99 times. That is, altogether 30,000 elements were retrieved 
from the TPC-D database, and 10,000 from TPC-H and APB-1. The average 
retrieval time, as a function of the cache size (or memory) used, can also be 
seen in Figures [2HH] The data relating to the multidimensional physical rep- 
resentation are labelled as 'Array,' and the data for the table representation as 
'Table.' 

The diagrams suggest that the model fits the empirical data quite well. Only 
the table representation of TPC-H and ABP-1 deviates slightly from it. 

The test results of the first ten passes and the last ten passes can be seen 
in Tables [7] and HI as well. Column A is the sequence number. Columns B-E 
correspond to TPC-D, columns F-I to TPC-H, while columns J-M are for 
APB-1. Columns B, F and J show the memory needed for the multidimensional 
representation, while columns C, G and K give the same for the table repre- 
sentation. The retrieval time with the multidimensional representation can be 
found in columns D, H and L, and the table representation in columns E, I and 
M. The 'memory used' values are strictly increasing. This can be attributed to 
the fact that increasingly larger parts of the physical representations are cached 
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Figure 4: The retrieval time for the TPC-H benchmark database as a function 
of the memory size available for caching 
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Figure 5: The retrieval time for the APB-1 benchmark database as a function 
of the memory size available for caching 
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into the memory. 

Looking at Tables [7]-|S] and Figures [2H1J it can be seen that the multidi- 
mensional representation was always significantly faster over the tested range. 



Table 7: Memory used (in 2 10 bytes) and retrieval time (in milliseconds) for the 
TPC-D and TPC-H benchmark databases 
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00 


4 


27,025 


27,420 


5 


58 


14 


36 


153 591 


12,652 


G 


43 


21 


01 


5 


28,841 


32,668 


5 


26 


14 


00 


154,367 


15,528 


6 


66 


19 


61 


6 


30,565 


37,896 


4 


83 


13 


88 


155,139 


18,328 


6 


23 


19 


63 


7 


32,113 


42,908 


4 


61 


13 


87 


155,919 


21,160 


6 


75 


18 


54 


8 


33,557 


47,684 


4 


60 


13 


92 


156,707 


23,992 


6 


67 


19 


14 


9 


34,949 


52,228 


4 


37 


12 


56 


157,463 


26,760 


6 


70 


18 


85 


10 


36,289 


56,792 


4 


12 


14 


58 


158,231 


29,456 


6 


53 


18 


55 


91 


63,609 


216,352 
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211,143 


193,868 
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217,228 
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218,060 
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5 
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94 
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218,784 
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3 


29 
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4 


93 


15 


29 


95 


63,813 


219,484 





28 


3 


31 


213,359 


200,584 


4 


82 


14 


95 


96 


63,841 


220,200 





34 


2 


82 


213,895 


202,164 


5 


56 


13 


67 


97 


63,857 


220,804 





13 


2 


78 


214,439 


203,760 


5 
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14 


48 


98 


63,905 


221,592 





30 


3 


23 


215,019 


205,464 


5 


34 


14 


54 


99 


63,925 


222,260 
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2 


94 


215,583 


207,140 


5 


43 


15 


57 
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63,949 


222,908 
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78 


216,099 
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5 


03 


14 
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Summarizing our experimental results, we may say that: 

• The size of DHC was smaller than that of the difference sequence com- 
pression. 

• With suitably designed experiments, we were able to measure the con- 
stants of the model proposed in the previous section. 

• We verified the model with empirical data. 

• Over the tested range of available memory, the multidimensional repre- 
sentation was always much quicker than the table representation in terms 
of retrieval time. 

6 Conclusion 

It often turns out that caching significantly improves response times. This 
was also found to be the case for us when the same relation was represented 
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Table 8: Memory used (in 2 10 bytes) and retrieval time (in milliseconds) for the 
APB-1 benchmark database 



A 


j 


K 




L 




M 
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7,204 
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21 


53 


3 


6,478 


10,312 


(j 


48 


19 


83 


4 


7,262 


13,452 


(j 


85 


20 


03 


5 


8,002 


16 328 


(j 


35 


19 
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(j 
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19 
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7 
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22 208 
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42 


19 


56 


8 


10,266 


25,076 


7 


02 


19 


23 


9 


10,978 


27,884 


6 


35 


19 


13 


10 


11,726 


30,664 


6 


68 


19 


92 


91 


52,334 


201,140 


3 


72 


13 


82 


92 


52,726 


202,836 


4 


46 


14 


86 


93 


53,046 


204,540 


3 


55 


14 


75 


94 


53,438 


206,240 


3 


98 


14 


52 


95 


53,754 


207,960 


3 


47 


15 


77 


96 


54,090 


209,516 


3 


82 


14 


12 


97 


54,382 


211,100 


3 


09 


14 


01 


98 


54,670 


212,660 


3 


13 


13 


53 


99 


55,054 


214,404 


3 


89 


14 


74 


100 


55,358 


216,144 


2 


97 


14 


83 



physically in different ways. In order to analyse this phenomenon, we proposed 
two models. 

In the first model, the dominance of the I/O cost rule was used to examine 
the caching effects. Uniform distribution was assumed for the analysis. We 
found that the expected number of pages brought into the memory is a linear 
function of the buffer cache size. And we know that the time to retrieve a 
cell/row from the database is proportional to the number of database pages 
copied from the disk into the memory. 

The second model was built in accordance with the findings of the first one. 
In the latter model, four constants were introduced for the retrieval time from 
the memory (M m and M t ) and from the disk (D m and D t ). It was necessary to 
have four symbols as we had to distinguish between the multidimensional rep- 
resentation (M m and D m ) and the table representation (M t and D t ). Based on 
the model, necessary and sufficient conditions were given for when one physical 
representation results in a lower expected retrieval time than the other. Actu- 
ally, with the tested benchmark databases, we found that the expected retrieval 
time was smaller with a multidimensional physical representation if less than 
63.2% of the table representation was cached. This was true regardless of the 
caching level of the multidimensional representation. 

We were able to infer from the second model that the complexity of the al- 
gorithm could determine the speed of retrieval when (nearly) the entire physical 
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representation was cached into the memory. A less CPU-intensive algorithm 
will probably result in a faster operation. It is important to mention that the 
first model is unable to explain this phenomenon. The reason for this is that the 
dominance of the 1/ cost rule ignores the time requirements of the memory 
operations. 

Using a slightly modified version of the second model, we investigated the 
speed-up factor, which can be achieved, if the multidimensional representation 
is used instead of the table one. We found that, depending on the memory size 
available for caching, the speed-up can be 2-3 orders of magnitude. That is 
why it is very important to also take into account the caching effects, when the 
performances of the different physical representations are compared. 

Experiments were performed to measure the constants of the model. We 
found that there was a big difference in values between M m and M t , as well as 
D m and D t . The difference of the first two constants can be accounted for by 
the different CPU-intensity of the algorithms. The reason why D m <C D t is that 
the multidimensional representation requires much less I/O operations than the 
table representation when one cell/row is retrieved. This latter observation is 
in line with the dominance of the I/O cost rule. However, instead of counting 
the number of I/O operations, we chose to determine the values of D m and D t 
from empirical data. 

We verified the model with additional experiments and found that the model 
fitted the experimental results quite well. There was only a slight difference with 
the table representation of the TPC-H and APB-1 benchmark databases. 

Finally, over the tested range of available memory, the multidimensional 
representation was always much faster than the table representation in terms of 
average retrieval time, as it can be seen in Figures [3]-[5l 

Based on the above results, we think, like Westmann et al. [18], that today's 
database systems should be extended with compression capabilities to improve 
their overall performance. 
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